Using latent semantic analysis to identify similarities in source code to support program understanding
نویسندگان
چکیده
The paper describes the results of applying Latent Semantic Analysis (LSA), an advanced information retrieval method, to program source code and associated documentation. Latent Semantic Analysis is a corpus-based statistical method for inducing and representing aspects of the meanings of words and passages (of natural language) reflective in their usage. This methodology is assessed for application to the domain of software components (i.e., source code and its accompanying documentation). Here LSA is used as the basis to cluster software components. This clustering is used to assist in the understanding of a nontrivial software system, namely a version of Mosaic. Applying Latent Semantic Analysis to the domain of source code and internal documentation for the support of program understanding is a new application of this method and a departure from the normal application domain of natural language.
منابع مشابه
Support for Software Maintenance Using Latent Semantic Analysis
The paper describes the results of applying semantic (versus structural) methods to the problems of software maintenance and program comprehension. Here, the focus is on tools to assist programmer to understand large legacy software systems. The method applied, Latent Semantic Analysis, is a corpus-based statistical method for inducing and representing aspects of the meanings of words and passa...
متن کاملSemantic clustering: Identifying topics in source code
Many of the existing approaches in Software Comprehension focus on program program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use...
متن کاملAutomatic Software Clustering via Latent Semantic Analysis
1 This paper appears in the 14 IEEE ASE’99, Cocoa Beach FL, Oct. 12-15, pp. 251-254 Abstract The paper describes the initial results of applying Latent Semantic Analysis (LSA) to program source code and associated documentation. Latent Semantic Analysis is a corpus-based statistical method for inducing and representing aspects of the meanings of words and passages (of natural language) reflecti...
متن کاملIdentification of High-Level Concept Clones in Source Code
Source code duplication occurs frequently within large software systems. Pieces of source code, functions, and data types are often duplicated in part, or in whole, for a variety of reasons. Programmers may simply be reusing a piece of code via copy and paste or they may be “reinventing the wheel”. Previous research on the detection of clones is mainly focused on identifying pieces of code with...
متن کاملUsing Traceability Links to Assess and Maintain the Quality of Software Documentation
The paper proposes an approach for using traceability links to assess and maintain the quality of software documentation. Our position is that quality documentation should accurately reflect the structure of the source code; hence elements of documentation that link to strongly coupled elements of the source code should also be strongly related. We use latent semantic indexing (LSI) to compute ...
متن کامل